Architecture

The major components of Hive and its interaction with the Hadoop is demonstrated in
  • User Interface
  • Driver
  • Compiler
  • Metastore
  • Execution Engine

Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. This helps the driver to track the progress of various data sets distributed over the cluster. It stores the data in a traditional RDBMS format.

Driver – It acts like a controller which receives the HiveQL statements. The driver starts the execution of the statement by creating sessions. It monitors the life cycle and progress of the execution. Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also acts as a collection point of data or query result obtained after the Reduce operation.

Hive Clients
The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types:
  • Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those languages that support Thrift.
  • JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.
  • ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the Hive server.
Hive Services
Apache Hive provides various services as shown in above diagram. Now, let us look at each in detail:
  • Beeline: The Beeline is a command shell supported by HiveServer2, where the user can submit its queries and command to the system. It is a JDBC client that is based on SQLLINE CLI (pure Java-console based utility for connecting with relational database and executing SQL queries).
  • CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.
  • Web Interface – Hive also provides web based GUI for executing Hive queries and commands. See here different Hive Data types and operators.
  • Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients to submit requests to Hive and retrieve the final result.
  • Hive Deriver – Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client.
  • Compiler –After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore.
  • Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks.
  • Executor – Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop.
A service that provides metastore access to other Apache Hive services. Disk storage for the Hive metadata which is separate from HDFS storage.
How to process data with Apache Hive
Apache Hive data processing workflow.
Now we will discuss how a typical query flows through the system-
  • User Interface (UI) calls the execute interface to the Driver.
  • The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan.
  • The compiler needs the metadata. So it sends a request for getMetaData. Thus receives the sendMetaData request from Metastore.
  • Now compiler uses this metadata to type check the expressions in the query. 
  • The compiler generates the plan which is DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. The plan contains map operator trees and a reduce operator tree for map/reduce stages.
  • Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. 
  • Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations. Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.

No comments:

Post a Comment